Questions
- How can I create publication-quality graphics in R?
Objectives
To be able to use
ggplot2to generate publication quality graphics.To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and coloring or panelling by groups.
Plotting the data is one of the best ways to quickly explore it and generate hypotheses about various relationships between variables.
There are several plotting systems in R, but today we will focus on ggplot2 which implements grammar of graphics - a coherent system for describing components that constitute visual representation of data. For more information regarding principles and thinking behind ggplot2 graphic system, please refer to Layered grammar of graphics by Hadley Wickham (@hadleywickham).
The advantage of ggplot2 is that it allows R users to create publication quality graphics with just a few lines of code. ggplot2 has a large user base and is constantly developed and extended by the community.
ggplot2 is a core member of tidyverse family of packages. Installing and loading the package under the same name will load all of the packages we will need for this workshop. Lets get started!
Add a new chunk to this R Markdown Notebook by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
# install.packages("tidyverse")
# install.packages("gapminder")
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2.9005 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
If above code produces an error “there is no package called ‘tidyverse’”, uncomment (remove #) the line above and run install.packages()command before you load the library. You only need to install the package once, but you will have to reload it, using the library() command, every time you restart R.
Today we will be working with the gapminder dataset, which is the excerpt from the GAPMINDER data. First let’s load the data.
gapminder <- gapminder::gapminder
You can have a look at the content of the gapminder data frame by simply typing gapminder either in the R-chunk or in the console. Data frame is a rectangular collection of data, where variables are organized as columns and observations are listed as rows.
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
The dataset contains the following fields:
More information about the package and the data is available in help. Just type ?gapminder in console, located in the bottom panel of your RStudio, or type gapminder in the search field of the Help tab of the bottom-right RStudio panel. Whenever you are unsure about anything in R, it is a good idea to check out the help file using one of the two methods described above.
Here’s a question that we would like to answer using
gapminderdata: Do people in rich countries live longer then people in poor countries? The answer may be quite intuitive, but we will continue our investigation further: how does the relationship between GDP per capita and Life expectancy look like? Is this relationship linear? Non-linear? Are there exceptions to the general rule (outliers)?
To plot gapminder, run the following code in the R-chunk or in console. The following code will put gdpPercap on the x-axis and lifeExp on the y-axis:
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp))
Note that we split the function into two lines. The plus sign indicates that the function is not over yet and that the next line should be interpreted as additional layer to the preceding ggplot() function. In other words, when writing a ggplot() function spanning several lines, the + sign goes at the end of the line, not in the beginning.
The plot shows positive non-linear relationship between GDP per capita and Life expectancy.
Does this graph confirm or disprove your initial hypothesis about the relationship between these variables?
Note that in order to create a plot using ggplot2 system, you should start your command with ggplot() function. It creates an empty coordinate system and initializes the dataset to be used in the graph (which is supplied as a first argument into the ggplot() function). In order to create graphical representation of the data, we can add one or more layers to our otherwise empty graph. Functions starting with the prefix geom_ create a visual representation of data. In this case we added scattered points, using geom_point() function. There are many geoms in ggplot2, some of which we will learn in this lesson.
geom_ function create mapping of variables from the earlier defined dataset to certain aesthetic elements of the graph, such as axis, shapes or colors. The first argument of any geom_ function expects the user to specify these mappings, wrapped in the aes() (short for aesthetics) function. In this case, we mapped gdpPercap and lifeExp variables from gapminder dataset to x and y-axis, respectively (using x and y arguments of aes() function).
Generally speaking, the template for visualizing data in ggplot2 can be summarized as follows:
`ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))`
In the remainer of this lesson we will learn how to extend and complete this template using different elements to produce various visualizations. First, we will look closer at the <MAPPINGS> component.
- How did Life expectancy change over time? What do you observe? Note that many points are plotted on top of each other. This is called “overplotting”. Try a different
geom_function calledgeom_jitter. It will spread the points apart a little bit using random noise.Hint: the
gapminderdataset has a column calledyear, which should appear on the x-axis.
- See if you can visualize Life expectancy by continent. Which continent tends to have higher life expectancy (notice the density of the points along the y-axis)? Lowest life expectancy? Which continent has highest spread in life expectancy values? How about lowest spread?
## Part 1
ggplot(gapminder)+
geom_point(mapping = aes(x=year, y=lifeExp))
# fix overplotting
ggplot(gapminder)+
geom_jitter(mapping = aes(x=year, y=lifeExp))
## Part 2
ggplot(gapminder)+
geom_jitter(mapping = aes(x=continent, y=lifeExp))
What if we want to combine graphs from the previous two challenges and show the relationship between three variables in the same graph? Turns out, we don’t necessarily need to use third geometrical dimension, we can simply employ color.
The following graph maps continent variable from gapminder dataset to the color aesthetic of the plot. Let’s take a look:
ggplot(data = gapminder) +
geom_jitter(mapping = aes(x = year, y = lifeExp, color=continent))
What will happen if you switch the mappings of
continentandyearin the previous example? Is the graph still useful? Why? What if you mapcoloraesthetic tocountry? What has changed? How isyeardifferent fromcountry? What is the limitation of thecoloraesthetic, when used to visualize different types of data?Can you add a little color to our initial graph of life expectancy by GDP per capita? Color the points by continent. There seem to be some outliers in this graph. Can you now spot which continent to these points belong to? How about using color gradient to illustrate change over time?
Hint: you may want to transform GDP per capita to logarithmic scale before plotting. Just wrap the name of the variable into the
log()function
## Part 1
ggplot(data = gapminder) +
geom_jitter(mapping = aes(x = continent, y = lifeExp, color=year))
# Color by country
ggplot(data = gapminder) +
geom_jitter(mapping = aes(x = continent, y = lifeExp, color=country))
## Part 2
ggplot(data = gapminder) +
geom_jitter(mapping = aes(x = gdpPercap, y = lifeExp, color=continent))
# change over time
ggplot(data = gapminder) +
geom_jitter(mapping = aes(x = gdpPercap, y = lifeExp, color=year))
There are other aesthetics that can come handy. One of them is size. The idea is that we can vary the size of datapoints to illustrate another continuous variable, such as country population. Lets look at four dimensions at once!
ggplot(data = gapminder) +
geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp, color=continent, size=pop))
There’s one more useful aesthetic property of the graph which is good for visualizing low-cardinality categorical variables (categorical variables with small number of unique values), called shape. The idea is that you can employ different shapes (other than circles) to plot the data.
- Blow your mind by visualizing five(!) dimensions in the same graph. Modify the previous example mapping year to color and shape to continent. What can you say about those Asian outliers: are those belonging to small or large countries? Are they from earlier or later time periods?
ggplot(data = gapminder) +
geom_point(mapping = aes(x = log(gdpPercap), y = lifeExp, color=year, shape=continent, size=pop))
Combining too many aesthetics in the same graph can make it quite busy. However, you can always make remove certain aesthetic properties and use several graphs to highlight different aspects of data.
Until now, we explored different aesthetic properties of a graph mapped to certain variables. What if you want to recolor or use certain shape to plot all datapoints? Well, that means that such color or shape will no longer be mapped to any data, so you need to supply it to geom_ function as a separate argument (outside of the mapping). Here’s our initial graph with all colors colored in blue.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp), alpha=0.1, size=2, color="blue")
Once more, observe that the color is now not mapped to any particular variable from the gapminder dataset and applies equally to all datapoints, therefore it is ourside the mapping argument and is not wrapped into aes() function. Note that unmapped colors are supplied as characters (in quotes), size is a number (size of point in mm) and shape is the ordinal index of the shape in R’s internal vocabulary (where square is 0, circle is 1, triangle is 2 and small filled circle is 20). Explore different shapes by varying the shape number between 0..25 or refer to ggplot2 documentation, called [vignettes] (http://docs.ggplot2.org/current/vignettes/ggplot2-specs.html), for details. This document can be also called from within R by calling vignette("ggplot2-specs").
Try mapping
coloraesthetic tocontinentand toyear. What do you notice? What might be the reason for different treatment of these variables byggplot?Change the transparency of the datapoints inside and outside the aesthetics. What can be the benefit of each one of these methods?
## Part 1
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent))
#
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = year))
## Part 2
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp), alpha=0.1)
#
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, alpha = year))
Next, we will consider different options for ggplot2 graph example. Using different geom_ functions user can highlight different aspects of data. For example, we could connect individual datapoints belonging to the same country into a line and illustrate the development of life expectancy over time for each country separately using geom_line() function.
Some geom_ functions allow additional aesthetics, such as aesthetic group in the geom_line() function. This aesthetic may not have any meaning in other geoms, but here it allows us to draw multiple lines, one per country. To keep the lines organized, we will color them by continent.
ggplot(data = gapminder) +
geom_line(mapping = aes(x = year, y = lifeExp, group=country, color=continent))
Note how life expectancy suddenly drops for certain countries in for short period of time. We will learn how to zoom in to those tragic periods of history and investigate which countries experienced them later in this workshop.
Another useful geom function is geom_boxplot(). It adds a layer with the “box and whiskers” plot illustrating the distribution of values within categories. The following chart breaks down life expectancy by continent, where the box represents first and third quartile (the 25th and 75th percentiles), the middle bar signifies the median value and the whiskers extent to cover 95% confidence interval. Outliers (outside of the 95% confidence interval range) are shown separately.
ggplot(data = gapminder) +
geom_boxplot(mapping = aes(x = continent, y = lifeExp))
Layers can be added on top of each other. In the following graph we will place the boxplots over jittered points to see the distribution of outliers more clearly. We can map two aesthetic properties to the same variable. Here we will also use different color for each continent.
ggplot(data = gapminder) +
geom_jitter(mapping = aes(x = continent, y = lifeExp, color=continent)) +
geom_boxplot(mapping = aes(x = continent, y = lifeExp, color=continent))
Now, this was slightly inefficient due to duplication of code - we had to specify the same mappings for two layers. To avoid it, you can move common arguments of geom_ functions to the main ggplot() function. In this case every layer will “inherit” the same arguments, specified in the “parent” function.
ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, color=continent)) +
geom_jitter() +
geom_boxplot()
You can still add layer-specific mappings or other arguments by specifying them within individual geoms. We would recommend building each layer separately and then moving common arguments up to the “parent” function.
We can use linear models to highlight differences in dependency between GDP per capita and life expectancy by continent. Notice that we added a separate argument to the geom_smooth() function to specify the type of model we want ggplot2 to built using the data (linear model). The geom_smooth() function has also helpfully provided confidence intervals, indicating “goodness of fit” for each model (shaded gray area). For more information on statistical models, please refer to help (by typing ?geom_smooth)
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp, color=continent)) +
geom_point(alpha=0.5) +
geom_smooth(method="lm")
Notice, that we also used new visual property called alpha to increase transparency of the data points and make trend lines stand out. alpha property can also be used as a mapping aesthetic, i.e. transparency can be made to vary depending on the value of certain variable.
- Modify the graph above to force R to create single regression line for all data points. Keep the points colored by continent. Hint: There could be several alternative solutions to this problem
In the graph above, each geom inherited all three mappings: x, y and color. If we want only single linear model to be built, we would need to limit the effect of color aesthetic to only geom_point() function, by moving it from the “parent” function to the layer where we want it to apply. Note, though, that because we want the color to be still mapped to the continent variable, it needs to be wrapped into aes() function and supplied to mapping argument.
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(mapping=aes(color=continent), alpha=0.5) +
geom_smooth(method="lm")
# Alternative solution
ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp, color=continent)) +
geom_point(alpha=0.5) +
geom_smooth(method="lm", color="black")
As you can observe the x-axis label of our graph says log(gdpPercap), which indicates that we are not really plotting the original data, but rather the output of log() function. The same effect (with slightly more aesthetically pleasing x-axis label) can be achieved by specifying the x-axis scale transformation as a separate layer. Instead of transforming the values, we will transform the scale of x-axis.
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent)) +
geom_point() +
geom_smooth() +
scale_x_log10()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Now the x-axis is measured in log10 units and the data, plotted on log10 scale looks more linear. Certain scale and coordinate functions may result in similar visual effects on the chart, but the way they interact with other aesthetic elements may be quite different. Check out the online ggplot2 documentation for more details and examples of using scale and coordinate transformations.
Make a boxplot of life expectancy by year. Hint: You may need to do something with the
yearvariable to force it to be categorical, or follow the advice suggested byggplot. When was interquartile range of life expectancy the smallest? Make the same plot ofgdpPercap(on a log scale) per year. Compared to 1952, is the world today more or less diverse in terms of IQR of GDP per capita?Make a histogram of untransformed and transformed
gdpPercap? What is the shape of the distribution? Why is bin parameter important for interpretation of the histogram? Build a density function. How would you compare density functions of different continents?Based on graph produced using geom_density2d() function of log GDP per capita vs life expectancy, how many clusters of datapoints can you identify? What if you look at it by continent?
## Part 1
# force year to become categorical
ggplot(gapminder)+
geom_boxplot(mapping = aes(y=lifeExp, x=as.character(year))) # simple x=year will not work
# ggplot suggested solution
ggplot(gapminder)+
geom_boxplot(mapping = aes(y=lifeExp, x=year, group=year))
# gdpPercap
ggplot(gapminder)+
geom_boxplot(mapping = aes(y=gdpPercap, x=year, group=year))+
scale_y_log10()
## Part 2
ggplot(gapminder)+
geom_histogram(mapping = aes(x=gdpPercap))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# on log scale with higher number of bins
ggplot(gapminder)+
geom_histogram(mapping = aes(x=gdpPercap),bins=100) +
scale_x_log10()
# density
ggplot(gapminder)+
geom_density(mapping = aes(x=gdpPercap)) +
scale_x_log10()
# by continent
ggplot(gapminder)+
geom_density(mapping = aes(x=gdpPercap, color=continent)) +
scale_x_log10()
## Part 3
# Density 2d
ggplot(gapminder)+
geom_density2d(mapping = aes(x=gdpPercap, y=lifeExp)) +
scale_x_log10()
# by continent
ggplot(gapminder)+
geom_density2d(mapping = aes(x=gdpPercap, y=lifeExp, color=continent)) +
scale_x_log10()
Multi-layered graphs employing several aesthetics can look crowded. In order to avoid it, one can split the data into different graphs using panels of similar graphs. In ggplot2 this method is called “faceting”. Lets facet the graph above by continent and show the datapoints and the trend for each continent in a separate chart.
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth() +
scale_x_log10() +
facet_wrap( ~ continent)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The facet_wrap() layer takes a “formula” as its argument, denoted by the tilde (~). This tells R to draw a panel for each unique value in the continent column of the gapminder dataset. Faceting is useful when number of panels is limited. Notice that here R places panels from left to right, “wrapping” those panels that do not fit in one row onto the new line. Learn about advanced faceting, including faceting over several variables using help on ?facet_grid().
Reiterating our previously proposed ggplot2 template and adding what we learned until, now we can state:
`ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
<FACET_FUNCTION>`
- Try faceting by year, keeping the linear smoother. Is there any change in slope of the linear trend over the years? What if you look at linear models per continent?
# by year
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10() +
facet_wrap( ~ year)
# by continent
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10() +
facet_wrap( ~ continent)
There are only two countries from Oceania included in the dataset. Imagine that bars would be too narrow and labels would be unreadable. One way of dealing with it is flipping the coordinate system, i.e. plotting the same data as horizontal bars. Let’s try to show population of every Asian country in 2007.
ggplot(filter(gapminder, year==2007 & continent=="Asia")) +
geom_bar(mapping = aes(x=country, y=pop), stat="identity") +
coord_flip()
There are many function related to coordinate systems that allow, among other things, plotting in non-cartesian (e.g. polar and Mercator) coordinates and specifying manual limits for coordinate axis.
Lastly we will learn how to label and annotate the chart using labs and annotate functions.
ggplot(data = gapminder) +
geom_point(mapping = aes(x = gdpPercap/10^3, y = lifeExp, size=pop/10^6, color=continent)) +
scale_x_log10() +
facet_wrap(~year) +
labs(title="Life Expectancy vs GDP per capita over time",
subtitle="In the past 50 years, life expectancy has improved in most countries of the world",
caption="Source: Gapminder foundation, https://www.gapminder.org/data/",
x="GDP per capita, '000 USD",
y="Life expectancy, years",
color="Continent",
size="Population, mln")
The graph produced in the previous section looks quite good, but it requires a reader to follow the time aspect of the data by tracing the changes across panels. This may be better illustrated by “animating” the time dimension of the data and playing the twelve charts in front of user one after another.
# devtools::install_github("thomasp85/gganimate")
library(gganimate)
ggplot(data = gapminder, aes(x = gdpPercap/10^3, y = lifeExp, size=pop/10^6, color=continent)) +
geom_point() +
geom_text(mapping = aes(x = 40, y = 30, label=as.character(floor(year))), size = 16, color="grey") +
scale_x_log10() +
labs(title="Life Expectancy vs GDP per capita over time",
subtitle="In the past 50 years, life expectancy has improved in most countries of the world",
caption="Source: Gapminder foundation, https://www.gapminder.org/data/",
x="GDP per capita, '000 USD",
y="Life expectancy, years",
color="Continent",
size="Population, mln") +
theme_bw() +
# here comes a gganimate function
transition_time(year)+
ease_aes('linear')
We conclude this lesson by reiterating our ggplot2 data visualization template.
`ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>),
stat = <STAT>) +
<SCALE_FUNCTION> +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<LABS>`
We learned about seven parameters of ggplot functions. However, it is very rare that all six of them need to specified in a given graphic or chart. Most of the time ggplot offers useful defaults for everything other than data, geoms and mappings.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
Still bored?
- Use several graphs and necessary filters to narrow down your search to those few outliers with high gdpPercap. What are those countries and in which years? What might be the reason?
Hint: You may want to experiment with
geom_text()to get the country labels to show on the chart
- Use several graphs and necessary filters to narrow down your search to those few outliers with extraordinarily low life expectancy. What are those countries and in which years? What might be the reason?
When you are working on data from different countries, it might also be an idea to actually use maps to convey your data in a familiar way. ggplot2 has a new geom called geom_sf what will help you plot maps and use aethetics in the same way as in other geoms.
We have downloaded world data file from thematicmappin.org, called a shapefile and will use this to create maps. Inthis case, we use the entire folder that was downloaded as a source, and a package in R called sf know how to read this as a map coordinate system.
# install.packages("rnaturalearth")
# world <- rnaturalearth::ne_countries(returnclass = "sf") # may be easier to explain
# ggplot() +
# geom_sf(data = world) +
# theme_bw()
#world_map <- sf::st_read("data/TM_WORLD_BORDERS-0.3/TM_WORLD_BORDERS-0.3.shp", quiet = TRUE)
world_map <- rnaturalearth::ne_countries(returnclass = "sf")
ggplot(world_map) +
geom_sf(aes(fill = pop_est))+
scale_fill_viridis_c()+
theme_bw()